A Software Infrastructure for the CLEENEX Optimizer

نویسندگان

  • Helena Isabel de Jesus Galhardas
  • Mário Jorge Costa
  • Gaspar da Silva
  • Pável Pereira Calado
چکیده

The problems associated to data quality is an increasingly growing concern. Throughout this document we will focus on a specific data quality problem: the existence of approximate duplicate records. Data cleaning aims at correcting data quality problems that can be found in various situations. There are some data cleaning tools that address these data quality problems. One of the tasks of a data cleaning program consists in the approximate duplicate detection. The approximate duplicate detection must be efficient, because if we are dealing with a large amount of data, comparing all the records will result in a performance bottleneck. The goal of the optimizer in a data cleaning tool is to build several execution plans for the data cleaning program and, based on the cost of each execution plan, choose the most efficient. In order to have the optimizer, we need to build a software infrastructure to support it. In particular, this infrastructure must provide several alternatives that improve the efficiency of the approximate duplicate detection. In this thesis, we designed and implemented an infrastructure to support an optimizer for CLEENEX, a data cleaning tool. In this document we also describe the validation methodology regarding the implemented infrastructure.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Experience in Testing Compiler Optimizers Using Comparison Checking

This paper describes our experience of testing and debugging an optimizer using comparison checking. Although this study is based on Jaramillo et al.’s work, the experience will help those who test optimizers using this technique. In our implementation, important values during the execution of programs are output as a file trace before and after each optimization. Then a comparison phase checks...

متن کامل

A Status Report on XXL - a Software Infrastructure for Efficient Query Processing

XXL is a Java library that contains a rich infrastructure for implementing advanced query processing functionality. The library offers low-level components like access to raw disks as well as high-level ones like a query optimizer. On the intermediate levels, XXL provides a demand-driven cursor algebra, a framework for indexing and a powerful package for supporting aggregation. The library is p...

متن کامل

Healthcare Districting Optimization Using Gray Wolf Optimizer and Ant Lion Optimizer Algorithms (case study: South Khorasan Healthcare System in Iran)

In this paper, the problem of population districting in the health system of South Khorasan province has been investigated in the form of an optimization problem. Now that the districting problem is considered as a strategic matter, it is vital to obtain efficient solutions in order to implement in the system. Therefore in this study two meta-heuristic algorithms, Ant Lion Optimizer (ALO) and G...

متن کامل

Strata: A Software Dynamic Translation Infrastructure

Software dynamic translation is the alteration of a running program to achieve a specific objective. For example, a dynamic optimizer uses software dynamic translation to modify a running program with the objective of making the program run faster. In addition to its demonstrated utility in dynamic optimizers, software dynamic translation also shows promise for producing applications that are a...

متن کامل

MAO - An extensible micro-architectural optimizer

Performance matters, and so does repeatability and predictability. Today’s processors’ micro-architectures have become so complex as to now contain many undocumented, not understood, and even puzzling performance cliffs. Small changes in the instruction stream, such as the insertion of a single NOP instruction, can lead to significant performance deltas, with the effect of exposing compiler and...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015